BUDT704 Python Project Group 9: The Smithbusters¶

"Alone we can do so little; together we can do so much." — Helen Keller

Introduction¶

Who we are¶

The Smith Busters are a group of aspiring MSIS graduate students who are enthusiastic about data and want to emphasize the importance of data analytics in today's data-driven world. We are ready to apply and present the knowledge we have gained as data analysts both inside and outside of the University of Maryland in order to make a difference in the communities.

The SmithBusters are Rishikesh Baskaran, Yash Makadia, Satvik Narang, Anushka Ranjan and Shweta Salelkar.

What we are doing¶

The Smithbusters will produce visualizations using data from the US Census to help us visualize where places require assistance in any of the characteristics specified, such as education, employment etc. We will make predictions about how the numerous criteria we investigate impact the income and livelihood of Americans in various ways. SmithBuster intends to answer the following question with this data.

In order to build this index, we will need to evaluate answers to the following questions:

  1. What is the median household income in the United States? How does median income vary in different states and what is the distribution of type of employment of people in the states?
  2. How does the distribution of education levels and employment of the population within a state impact the income?
  3. Can there be a model which can be trained to understand the effect of each factor on income and estimate the income for a given state based on these factors?
  4. Based on the income predicted by the model, does a state have an opportunity to improve their median household income and what are the factors that can help in achieving this?

Why these questions matter:¶

Part 1¶

Learning which states are doing well and which fall short of the national average requires knowledge of the median income of American families and how it varies from state to state. Our research will focus on high-income states to learn the employment structure that has contributed to their success. With that, the states trailing behind automatically become our choice of interest to focus on in the later parts of our report.

Part 2¶

A popular belief is that better education will lead to more job opportunities and a higher income. We're here to find out if that's true by comparing the levels of education in the state with the most income and the state with the least. If their level of education and income are directly related, education would be a straightforward task to focus on.

Part 3¶

We will need a working model to put our analysis to use and know right away from the predicted income if a state has room for improvement.

Part 4¶

When our model does show that there is room for change, we will need to know which factors to start working on right away and how important each one is so that organizations can decide what to work on first.

How will we accomplish this? (Description of dataset)¶

Our data was obtained from the US Census data website (https://data.census.gov/). We have obtained information on the below factors for the year 2020:

Table Name -> Description
DP03 -> Economic characteristics DP05 -> Demographic and housing estimates
S1501 -> Educational attainment
S2506 Financial characteristics for housing units

Each of these tables consists of data for zipcode of US. In order to join relevant columns from these tables, we use Zip code as the primary key to join the above tables to create a consolidated dataset.

The columns we chose for our analysis largely had to do with how they relate to our objective. We needed to understand what factors affected the income of an individual and how his income affects his living conditions. DP03 gave us data on the income and income characteristics for the population and DP05 was about population characteristics and ethnicity. We used S1501 to understand the levels of educational attainment and of how many people in a city were educated up to high school, college and undergrad level. S2506 gave us two columns that informed us the mortgaged houses and what the median value was.

Choice for Heavier Grading - Data Processing¶

The Smithbusters believe that we should be graded more strongly on our data processing. In order to perform our analysis, build the community needs index, we were required to perform a number of cleaning, processing and transforming activities on the dataset.

Apart from having a highly dimensional dataset, we had to ensure that relevant factors were being considered which involved creation of a new dataset by merging multiple tables.

We also created high level factors (such as jobs) needed by the community by assigning weights to sub-factors (such as Public, private jobs, self-employed) and converted absolute numbers provided for these factors into proportions to compare geographies.

Data Gathering¶

In [1]:
# Importing all the required libararies 

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import numpy as np
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly.express as px
import warnings
warnings.filterwarnings("ignore")
In [2]:
# Reading the data reference csv containing list of columns and source tables to be used from the US Census website
df1=pd.read_csv('.\data\DataReference.csv')
In [3]:
#creating a list of respective columns to be filtered from each of the source tables

cols_DP03=df1[df1['TabName'].str.contains('DP03')].values[:,1].tolist()
cols_DP05=df1[df1['TabName'].str.contains('DP05')].values[:,1].tolist()
cols_S2506=df1[df1['TabName'].str.contains('S2506')].values[:,1].tolist()
cols_S1501=df1[df1['TabName'].str.contains('S1501')].values[:,1].tolist()
In [4]:
# 'Name' column would be used as the key to join the data 

cols_DP03.insert(0,'NAME')
cols_DP05.insert(0,'NAME')
cols_S2506.insert(0,'NAME')
cols_S1501.insert(0,'NAME')
In [5]:
# importing the csv files downloaded from US Census website

df_DP05=pd.read_csv(r'.\data\ACSDP5Y2020.DP05-Data.csv', dtype='str',header=0)
df_DP03=pd.read_csv(r'.\data\ACSDP5Y2020.DP03-Data.csv', dtype='str',header=0)
df_S2506=pd.read_csv(r'.\data\ACSST5Y2020.S2506-Data.csv', dtype='str',header=0)
df_S1501=pd.read_csv(r'.\data\ACSST5Y2020.S1501-Data.csv', dtype='str',header=0)
In [6]:
# First 2 rows indicate the header

df_DP05.drop([0], axis=0, inplace=True)
df_DP03.drop([0], axis=0, inplace=True)
df_S2506.drop([0], axis=0, inplace=True)
df_S1501.drop([0], axis=0, inplace=True)
In [7]:
# Final dataframe would consists of the individual dataframes joined together on the key 'NAME' (which stored the zip code)

df_merged=pd.merge(df_DP03[cols_DP03], df_DP05[cols_DP05], on='NAME')
df_merged=pd.merge(df_merged,df_S2506[cols_S2506], on='NAME')
df_merged=pd.merge(df_merged,df_S1501[cols_S1501], on='NAME')

Data Processing¶

The data we collected from the four tables was highly dimensional. The table below depicts the number of columns each data table consisted of. The number of rows in each table was 33122, which is the number of Zip codes in America.

Sr-- Table -- Description
1 -- DP03 -- Economic characteristics
2 -- DP05 -- Demographic and housing estimates
3 -- S1501 -- Educational attainment
4 -- S2506 -- Financial characteristics for housing units

As part of our analysis, the final data frame created after merging the relevant columns consists of 33122 rows and 39 columns. In the next step, we will perform the following activities:

1. Renaming the columns:
The column names imported from the csv files from the US Census website are present in a coded format (Ex - DP05_0001E). Using the metadata information (also available from the website) we have renamed the columns as per their description provided in the metadata.

2. Extracting Zip code values using regex:
The column ‘Name’ contains the zip code values prefixed with another code, called zip code tabulation code (Ex - ZCTA5 00601). We use a regular expression to extract 5 numeric digits from this column which represents the zip code.

3. Adding City and State information with Zipcode:
We will add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how a particular zip code is performing when compared to the city or state average.

4. Identification of columns with null/inconsistent values:
For our analysis, we came up with relevant factors such as average Income, Education, and Financial details of a person in a zip code. From our list of factors, we identified 2 factors - ‘Poverty’ and ‘Child Poverty’ for which no data was available for any zip code. These columns were hence dropped from the analysis.

5. Identification of rows with null/inconsistent values:
We analyzed the column 'Income' and found 1669 zip codes contain inconsistent values such as '*' or '-'. This value amounts to ~5% of the total population hence we decided to exclude them from our analysis by dropping them.

6. Data type conversion for numeric columns:
In order to perform aggregations such as average across all the zipcodes in a state, we need to convert the data type of columns from existing type (string) to an integer.

7. Transformation of absolute values to percentages
In order to compare factors such as education, employment etc across different geographical areas, we will need to transform the absolute values (which represent estimate of count of people for the given factor) into proportions by dividing it by total population in that area

8. Merging Sub-factors to create Factors using Weighted averages
Our last step in managing our data is to combine multiple factors of a particular category to a single factor, using comparative calculations. Data for each state will be compared with that of the National average to determine whether the factor it lacks in needs to be improved or not.

In [8]:
# To check the name of all the columns of the data frame
df_merged.columns
Out[8]:
Index(['NAME', 'DP03_0062E', 'DP03_0087M', 'DP03_0088E', 'DP03_0088M',
       'DP03_0128E', 'DP03_0129E', 'DP03_0027E', 'DP03_0028E', 'DP03_0029E',
       'DP03_0034E', 'DP03_0031E', 'DP03_0019E', 'DP03_0020E', 'DP03_0021E',
       'DP03_0022E', 'DP03_0023E', 'DP03_0024E', 'DP03_0025E', 'DP03_0001E',
       'DP03_0047E', 'DP03_0048E', 'DP03_0049E', 'DP03_0050E', 'DP03_0005E',
       'DP05_0001E', 'DP05_0002E', 'DP05_0003E', 'DP05_0072E', 'DP05_0037E',
       'DP05_0065E', 'DP05_0066E', 'DP05_0067E', 'DP05_0068E', 'DP05_0087E',
       'S2506_C01_001E', 'S2506_C02_040E', 'S1501_C01_009E', 'S1501_C01_011E',
       'S1501_C01_015E'],
      dtype='object')
In [9]:
df_merged.shape
Out[9]:
(33120, 40)

1. Renaming the columns:¶

Now we will rename the column name based on the Table name. As there are 30+ columns converting data frame into dictionary to get the list of all the column name and table name.

In [10]:
# To convert the data frame having column name and table name into dictionary. Column Name as a key and Table Name as a value in dictionary.
pd.Series(df1.ColName.values,index=df1.TabName).to_dict()
Out[10]:
{'DP05_0001E': 'TotalPop',
 'DP05_0002E': 'Men',
 'DP05_0003E': 'Women',
 'DP05_0072E': 'Hispanic',
 'DP05_0037E': 'White',
 'DP05_0065E': 'Black',
 'DP05_0066E': 'Native',
 'DP05_0067E': 'Asian',
 'DP05_0068E': 'Pacific',
 'DP05_0087E': 'Citizen',
 'DP03_0062E': 'Income',
 'DP03_0087M': 'IncomeErr',
 'DP03_0088E': 'IncomePerCap',
 'DP03_0088M': 'IncomePerCapErr',
 'DP03_0128E': 'Poverty',
 'DP03_0129E': 'ChildPoverty',
 'DP03_0027E': 'Professional',
 'DP03_0028E': 'Service',
 'DP03_0029E': 'Office',
 'DP03_0034E': 'Construction',
 'DP03_0031E': 'Production',
 'DP03_0019E': 'Drive',
 'DP03_0020E': 'Carpool',
 'DP03_0021E': 'Transit',
 'DP03_0022E': 'Walk',
 'DP03_0023E': 'OtherTransp',
 'DP03_0024E': 'WorkAtHome',
 'DP03_0025E': 'MeanCommute',
 'DP03_0001E': 'Employed',
 'DP03_0047E': 'PrivateWork',
 'DP03_0048E': 'PublicWork',
 'DP03_0049E': 'SelfEmployed',
 'DP03_0050E': 'FamilyWork',
 'DP03_0005E': 'Unemployment',
 'S2506_C01_001E': 'Housing with Mortgage',
 'S2506_C02_040E': 'Median Monthly housing costs',
 'S1501_C01_009E': 'High School Graduate',
 'S1501_C01_011E': 'Some College or Associates degree',
 'S1501_C01_015E': 'Bachelors degree or higher'}
In [11]:
# To rename the columns names
df_merged.rename(columns = {'DP05_0001E': 'TotalPop',
 'DP05_0002E': 'Men',
 'DP05_0003E': 'Women',
 'DP05_0072E': 'Hispanic',
 'DP05_0037E': 'White',
 'DP05_0065E': 'Black',
 'DP05_0066E': 'Native',
 'DP05_0067E': 'Asian',
 'DP05_0068E': 'Pacific',
 'DP05_0087E': 'Citizen',
 'DP03_0062E': 'Income',
 'DP03_0087M': 'IncomeErr',
 'DP03_0088E': 'IncomePerCap',
 'DP03_0088M': 'IncomePerCapErr',
 'DP03_0128E': 'Poverty',
 'DP03_0129E': 'ChildPoverty',
 'DP03_0027E': 'Professional',
 'DP03_0028E': 'Service',
 'DP03_0029E': 'Office',
 'DP03_0034E': 'Construction',
 'DP03_0031E': 'Production',
 'DP03_0019E': 'Drive',
 'DP03_0020E': 'Carpool',
 'DP03_0021E': 'Transit',
 'DP03_0022E': 'Walk',
 'DP03_0023E': 'OtherTransp',
 'DP03_0024E': 'WorkAtHome',
 'DP03_0025E': 'MeanCommute',
 'DP03_0001E': 'Employed',
 'DP03_0047E': 'PrivateWork',
 'DP03_0048E': 'PublicWork',
 'DP03_0049E': 'SelfEmployed',
 'DP03_0050E': 'FamilyWork',
 'DP03_0005E': 'Unemployment',
 'S2506_C01_001E': 'Housing with Mortgage',
 'S2506_C02_040E': 'Median Monthly housing costs',
 'S1501_C01_009E': 'High School Graduate',
 'S1501_C01_011E': 'Some College or Associates degree',
 'S1501_C01_015E': 'Bachelors degree or higher'
                             },inplace=True)
In [12]:
# To verify the column name of the data frame
df_merged.head()
Out[12]:
NAME Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office ... Black Native Asian Pacific Citizen Housing with Mortgage Median Monthly housing costs High School Graduate Some College or Associates degree Bachelors degree or higher
0 ZCTA5 00601 14398 2379 7256 639 (X) (X) 1005 1088 587 ... 427 102 2 0 13421 596 747 3316 1290 1868
1 ZCTA5 00602 16771 2297 9981 594 (X) (X) 2828 2455 2654 ... 6708 4740 17 0 30430 1908 855 7103 3061 5575
2 ZCTA5 00603 15786 3520 11794 1030 (X) (X) 3654 2615 2940 ... 2026 300 74 14 36770 3528 814 9627 2558 7653
3 ZCTA5 00606 14980 2954 6901 862 (X) (X) 150 316 237 ... 168 22 0 0 5007 174 549 1385 201 482
4 ZCTA5 00610 20167 3137 10643 832 (X) (X) 1787 1942 1910 ... 4229 2926 9 1 21725 1868 733 5966 2166 3292

5 rows × 40 columns

2. Extracting Zip code values using regex:¶

We add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how the a particular zip code is performing when compared to the city or state average.

In [13]:
# To extract the Zip Code from the NAME column and giving it a new column name 'ZipCode'
df_merged['ZipCode'] = df_merged['NAME'].str.findall(r'\d{5}')
In [14]:
# To get the first element of the list of the Zip Code column
df_merged.ZipCode = df_merged.ZipCode.map(lambda x: x[0])

3. Adding City and State information with Zipcode:</b>¶

We add the city and state information about the zipcodes in order to evaluate the mean of different parameters (such as Income, Education etc) at a geographical level and compare the results with the zip code. This will help in identifying how a particular zip code is performing when compared to the city or state average

In [15]:
# To add city and state information and map with the zip code
city_state_mapping_df = pd.read_csv(r'.\data\uszips.csv', dtype='str', header=0)
In [16]:
# To map the Zip code, city and state name
city_state_mapping_df = city_state_mapping_df[['zip','city','state_name']]
In [17]:
# To rename the Zip Code, State and City columns
city_state_mapping_df.rename(columns = {'zip':'ZipCode', 'city':'City',  'state_name':'State' }, inplace = True)
In [18]:
# To merge city and state based on the Zip code
df_merged = pd.merge(df_merged , city_state_mapping_df, on='ZipCode')
In [19]:
# To display the first five rows
df_merged.head()
Out[19]:
NAME Income IncomeErr IncomePerCap IncomePerCapErr Poverty ChildPoverty Professional Service Office ... Pacific Citizen Housing with Mortgage Median Monthly housing costs High School Graduate Some College or Associates degree Bachelors degree or higher ZipCode City State
0 ZCTA5 00601 14398 2379 7256 639 (X) (X) 1005 1088 587 ... 0 13421 596 747 3316 1290 1868 00601 Adjuntas Puerto Rico
1 ZCTA5 00602 16771 2297 9981 594 (X) (X) 2828 2455 2654 ... 0 30430 1908 855 7103 3061 5575 00602 Aguada Puerto Rico
2 ZCTA5 00603 15786 3520 11794 1030 (X) (X) 3654 2615 2940 ... 14 36770 3528 814 9627 2558 7653 00603 Aguadilla Puerto Rico
3 ZCTA5 00606 14980 2954 6901 862 (X) (X) 150 316 237 ... 0 5007 174 549 1385 201 482 00606 Maricao Puerto Rico
4 ZCTA5 00610 20167 3137 10643 832 (X) (X) 1787 1942 1910 ... 1 21725 1868 733 5966 2166 3292 00610 Anasco Puerto Rico

5 rows × 43 columns

4. Identification of columns with null/inconsistent values:¶

In [20]:
df_merged.columns
Out[20]:
Index(['NAME', 'Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
       'Poverty', 'ChildPoverty', 'Professional', 'Service', 'Office',
       'Construction', 'Production', 'Drive', 'Carpool', 'Transit', 'Walk',
       'OtherTransp', 'WorkAtHome', 'MeanCommute', 'Employed', 'PrivateWork',
       'PublicWork', 'SelfEmployed', 'FamilyWork', 'Unemployment', 'TotalPop',
       'Men', 'Women', 'Hispanic', 'White', 'Black', 'Native', 'Asian',
       'Pacific', 'Citizen', 'Housing with Mortgage',
       'Median Monthly housing costs', 'High School Graduate',
       'Some College or Associates degree', 'Bachelors degree or higher',
       'ZipCode', 'City', 'State'],
      dtype='object')
In [21]:
# To drop the Name, Poverty and ChildPoverty column as we dont have data in that column
df_merged.drop(['NAME', 'Poverty', 'ChildPoverty'], axis=1, inplace=True)
In [22]:
# To set the index of the data frame.
df_merged.set_index(['ZipCode'], inplace=True)

5. Identification of rows with null/inconsistent values:¶

In [23]:
# To check number of rows and columns
df_merged.shape
Out[23]:
(32921, 39)
In [24]:
# To filter out the rows that have non numeric rows in Income column
df_final = df_merged[df_merged['Income'].apply(lambda x: x.isnumeric())]
In [25]:
# To check the number of rows after dropping the rows that have non numeric rows in Income column
df_final.shape
Out[25]:
(30474, 39)
In [26]:
# To verify the name of columns
df_final.columns
Out[26]:
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
       'Professional', 'Service', 'Office', 'Construction', 'Production',
       'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
       'MeanCommute', 'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed',
       'FamilyWork', 'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic',
       'White', 'Black', 'Native', 'Asian', 'Pacific', 'Citizen',
       'Housing with Mortgage', 'Median Monthly housing costs',
       'High School Graduate', 'Some College or Associates degree',
       'Bachelors degree or higher', 'City', 'State'],
      dtype='object')
In [27]:
del df_final['MeanCommute']

As the count of values is very less in Asian , Pacific and Native columns so adding all the count to new column 'RaceOther' , and then dropping the original three columns so reduce the number of columns to analyze.

In [28]:
convert_dict = {'Native': int,'Pacific': int,'Asian': int}
 
df_final = df_final.astype(convert_dict)
In [29]:
# # To add all the values of the Asian, Pacific and Native and creating one new column 'RaceOther'
df_final['RaceOther']= df_final['Asian'] + df_final['Pacific']+ df_final['Native']
df_final=df_final.drop(['Asian', 'Pacific','Native'], axis=1)
In [30]:
df_final['RaceOther']
Out[30]:
ZipCode
00601     104
00602    4757
00603     388
00606      22
00610    2936
         ... 
99921     572
99922     328
99925     505
99926    1410
99929     847
Name: RaceOther, Length: 30474, dtype: int32
In [31]:
# To verify the updated column name
df_final.columns
Out[31]:
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
       'Professional', 'Service', 'Office', 'Construction', 'Production',
       'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
       'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork',
       'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White',
       'Black', 'Citizen', 'Housing with Mortgage',
       'Median Monthly housing costs', 'High School Graduate',
       'Some College or Associates degree', 'Bachelors degree or higher',
       'City', 'State', 'RaceOther'],
      dtype='object')

6. Data type conversion for numeric columns:¶

Currently done for only Income column and it will be expanded to other column as well.

In [32]:
# To change the data type of Income
df_final.Income = df_final.Income.astype(int)
In [33]:
to_be_dropped = df_final[~df_final.iloc[:, 29].str.isnumeric()].index
In [34]:
df_final.drop(to_be_dropped, inplace=True)
In [35]:
convert_dict = {'Income':int,
       'Professional':int, 'Service':int, 'Office':int, 'Construction':int, 'Production':int,
       'Drive':int, 'Carpool':int, 'Transit':int, 'Walk':int, 'OtherTransp':int, 'WorkAtHome':int,
       'Employed':int, 'PrivateWork':int, 'PublicWork':int, 'SelfEmployed':int, 'FamilyWork':int ,'Unemployment':int, 'TotalPop':int, 'Men':int, 'Women':int, 'Hispanic':int, 'White':int,
       'Black':int, 'Citizen':int, 'Housing with Mortgage':int,
       'Median Monthly housing costs':int, 'High School Graduate':int,
       'Some College or Associates degree':int, 'Bachelors degree or higher':int
                }
df_final = df_final.astype(convert_dict)

7. Transformation of absolute values to percentages¶

In [36]:
# To convert education related columns to percentage values
df_final.iloc[:,30:33] = df_final.iloc[:,30:33].div(df_final['TotalPop'], axis=0)
df_final.iloc[:,30:33] =df_final.iloc[:,30:33].apply(lambda x: x*100)

8. Merging Sub-factors to create Factors using Weighted averages (which will be used in CNI later)¶

Our last step in managing our data is to combine multiple factors of a particular category to a single factor, using comparative calculations. Data for each state will be compared with that of the National average to determine whether the factor it lacks in needs to be improved or not.

In [37]:
# To verify the updated column name
df_final.columns
Out[37]:
Index(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',
       'Professional', 'Service', 'Office', 'Construction', 'Production',
       'Drive', 'Carpool', 'Transit', 'Walk', 'OtherTransp', 'WorkAtHome',
       'Employed', 'PrivateWork', 'PublicWork', 'SelfEmployed', 'FamilyWork',
       'Unemployment', 'TotalPop', 'Men', 'Women', 'Hispanic', 'White',
       'Black', 'Citizen', 'Housing with Mortgage',
       'Median Monthly housing costs', 'High School Graduate',
       'Some College or Associates degree', 'Bachelors degree or higher',
       'City', 'State', 'RaceOther'],
      dtype='object')
In [38]:
#To calculate the CNI we need to compare the population of a state with the national average for different factors and see how much potential improvement is possible. 
#In order to ensure we are accurate with our analysis, we need to treat the factors as they are featured in our day to day lives and with an emphasis on what factors can be improved
#We start by assigning weights to a dataframe for us to use in our calculations later

data_5_subfactors = [.2,.2,.2,.2,.2,.2]
weighted_averages_5_subfactors = pd.DataFrame(data_5_subfactors, columns=['Weights'])
data_3_subfactors = [.33,.33,.33]
weighted_averages_3_subfactors = pd.DataFrame(data_3_subfactors, columns=['Weights'])
In [39]:
# Create new pandas DataFrame.
cni_df = pd.DataFrame(df_final[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','TotalPop','Median Monthly housing costs','High School Graduate','Some College or Associates degree','Bachelors degree or higher','City','State']])
# Appending a new row in the data frame to include the National average 
new_row = {'State':'National'}
cni_df = cni_df.append(pd.DataFrame([new_row],index=['00000'],columns=cni_df.columns))
In [40]:
cni_df.sort_index().head(10)
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('-', '0')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('4,000+', '4000')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].replace('', '0')
cni_df['Median Monthly housing costs'] = cni_df['Median Monthly housing costs'].astype(float)
cni_df = cni_df.replace(np.nan, 0)
In [41]:
#Calculating the aggregate functions for each state 
#cni_df_mean = cni_df.groupby(['State']).mean()
cni_df_sum = cni_df.groupby(['State']).sum()
In [42]:
cni_df_sum.loc['National'] = cni_df_sum.mean()
household_costs = cni_df_sum.at['National','Median Monthly housing costs']

cni_df_sum['Median Monthly housing costs'] = ((cni_df_sum['Median Monthly housing costs'] / household_costs) *100) - 100
In [43]:
#for loop to convert to percentages based on population

for column in cni_df_sum[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','High School Graduate','Some College or Associates degree','Bachelors degree or higher']]:
    # Select column contents by column  
    cni_df_sum[column] = ((cni_df_sum[column] / (cni_df_sum['TotalPop'])) *100)
    #print(cni_df_sum.at['National',column])
    
cni_df_sum
Out[43]:
Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp ... PrivateWork PublicWork SelfEmployed FamilyWork Unemployment TotalPop Median Monthly housing costs High School Graduate Some College or Associates degree Bachelors degree or higher
State
Alabama 15.517892 6.996747 9.361843 2.905560 7.460299 36.431098 3.587654 0.142825 0.438670 0.390554 ... 34.217313 6.819153 2.352346 0.068796 2.554863 4.854899e+06 -18.165793 0.282468 0.072049 0.159863
Alaska 18.164113 8.300886 9.742261 3.248667 5.729972 32.782396 5.712424 0.591883 3.078941 2.340202 ... 32.065664 11.956176 3.210694 0.117657 3.615016 6.952390e+05 -74.700786 0.531333 0.089519 0.260296
Arizona 16.893299 8.384180 10.764726 3.186526 4.915529 33.121350 4.746640 0.708042 0.775876 1.102445 ... 35.938092 6.170506 2.762978 0.082710 2.772566 7.092238e+06 -41.171357 0.088568 0.029628 0.089994
Arkansas 15.233561 7.228191 9.258655 3.012652 7.508308 35.491897 4.402944 0.131575 0.685975 0.491731 ... 34.280291 6.625138 2.609711 0.088890 2.379219 2.982323e+06 -38.990689 0.447795 0.081341 0.187184
California 18.052831 8.866379 10.003480 3.129476 5.886095 33.723382 4.729881 1.953256 0.996406 1.054699 ... 36.717911 6.733633 3.636072 0.089660 3.169723 3.673532e+07 310.408103 0.059096 0.023061 0.083370
Colorado 22.574445 8.550391 10.853854 4.182373 5.205295 37.520998 4.391058 1.453044 1.321423 1.107471 ... 41.502135 7.008471 3.243752 0.123338 2.510374 5.638163e+06 -6.453467 0.138559 0.051702 0.198719
Connecticut 22.170371 8.985366 10.587594 3.173954 5.378130 38.758584 4.008446 1.830711 1.294363 0.639119 ... 41.023023 6.665408 3.116804 0.086822 3.301980 3.373552e+06 -29.940409 0.145793 0.046034 0.214605
Delaware 19.365242 8.101976 10.224503 3.267541 5.522569 37.000775 3.975466 1.029805 0.939908 0.532402 ... 38.011725 7.202903 2.049192 0.090522 2.932537 9.599880e+05 -88.392837 0.132495 0.037060 0.134530
District of Columbia 35.784307 7.463003 7.985717 1.566373 2.069634 17.397071 2.641422 17.303050 6.675175 3.687144 ... 38.768211 13.599548 2.336075 0.076042 4.248615 6.785740e+05 -92.594017 0.028614 0.005780 0.149399
Florida 16.586525 8.972596 11.123186 3.603624 4.791410 35.094141 4.149938 0.724587 0.602812 1.037378 ... 37.601212 5.339769 2.702514 0.087979 2.625330 2.107323e+07 83.591540 0.094155 0.030548 0.091858
Georgia 18.040724 7.494084 10.323714 3.105414 6.883665 35.927394 4.293264 0.889119 0.616355 0.829424 ... 37.547414 6.674288 2.535472 0.088588 2.775890 1.042808e+07 9.697993 0.144374 0.035129 0.099810
Hawaii 17.037514 10.949130 10.900893 3.466905 4.426400 33.225414 6.601400 2.699580 2.107894 1.494572 ... 34.601522 9.696170 3.277787 0.105341 2.273674 1.409701e+06 -76.966288 0.113378 0.042759 0.123871
Idaho 16.858347 7.894713 10.170935 3.821008 6.130272 35.884644 4.305448 0.292759 1.113590 0.903777 ... 36.414500 6.804723 3.396893 0.113224 2.041543 1.737264e+06 -63.418886 0.273096 0.089068 0.203593
Illinois 19.489175 8.265461 10.550409 2.670306 7.264796 34.573609 3.762452 4.236620 1.336477 0.870151 ... 40.813050 5.995118 2.244781 0.081697 3.120640 1.261241e+07 126.048495 0.245124 0.074099 0.174291
Indiana 16.895616 7.740172 9.917196 2.982455 9.407204 38.444218 4.254182 0.415381 0.926558 0.615683 ... 40.660537 5.217092 2.178391 0.076262 2.382582 6.659917e+06 0.314501 0.288240 0.068709 0.146369
Iowa 19.032455 8.108007 10.444292 3.357644 8.723400 40.429741 4.053605 0.448228 1.526999 0.659090 ... 40.958857 6.875755 3.269006 0.112591 2.075890 3.135234e+06 34.759644 0.731703 0.260454 0.407015
Kansas 19.587903 8.112023 10.204043 3.200810 7.264514 39.946177 4.403689 0.213104 1.034262 0.642028 ... 38.913726 7.839359 3.032888 0.114489 2.092403 2.872773e+06 -3.531575 0.476105 0.149642 0.370616
Kentucky 15.875338 7.237387 9.545596 2.729507 8.285739 35.841981 4.031094 0.413792 0.859868 0.574805 ... 36.155268 6.410191 2.338713 0.105730 2.539380 4.382880e+06 -18.023751 0.363542 0.080603 0.157939
Louisiana 15.409920 8.167556 9.453325 3.507870 5.694920 35.204949 3.886365 0.472184 0.764546 0.809719 ... 34.154816 6.615920 2.704819 0.077293 3.054733 4.604560e+06 -33.921620 0.248471 0.040908 0.118844
Maine 19.580573 8.775628 10.601421 3.699747 6.062370 38.218018 4.576935 0.267515 1.785283 0.639980 ... 39.336539 6.786071 4.249180 0.086046 2.116638 1.333010e+06 -35.127933 0.768307 0.223199 0.632789
Maryland 23.830384 8.591022 10.066531 3.694789 4.665454 36.673142 4.388598 3.748723 1.016108 0.804654 ... 37.485173 11.084008 2.555701 0.079659 2.815132 5.842390e+06 5.969199 0.143354 0.035596 0.175234
Massachusetts 25.029923 8.962845 10.274134 3.101142 4.957742 35.395434 3.785999 4.926456 2.359902 1.093837 ... 43.200730 6.552162 2.972921 0.082096 2.830026 6.658031e+06 40.586487 0.124896 0.043755 0.234664
Michigan 17.584876 8.061888 9.650896 2.578421 7.748206 36.949176 3.943358 0.580674 0.978408 0.574684 ... 39.363872 4.967974 2.317338 0.076014 2.986536 9.950858e+06 50.505568 0.229525 0.066974 0.162799
Minnesota 22.305615 8.316324 10.762776 3.227462 7.194304 39.406791 4.216674 1.669803 1.326165 0.791574 ... 43.486367 6.417065 2.826572 0.086691 2.089962 5.594551e+06 52.105558 0.341812 0.142519 0.263010
Mississippi 13.891386 7.423872 8.713395 2.803837 7.487517 35.093756 3.826217 0.110477 0.452029 0.638071 ... 31.722896 7.728780 2.288741 0.117371 3.204380 2.944501e+06 -50.171857 0.276613 0.085255 0.146752
Missouri 18.190505 8.063147 10.518584 3.139059 7.038833 38.331103 3.958306 0.549856 0.804997 0.562050 ... 39.361456 5.887171 2.663779 0.094980 2.264538 6.077044e+06 31.858342 0.405351 0.082820 0.208100
Montana 18.407747 9.221034 10.308342 4.045053 5.439803 36.501247 4.467341 0.350868 2.177973 1.067944 ... 36.780408 8.015906 4.162393 0.146195 2.109333 1.043126e+06 -50.787067 0.593173 0.183599 0.507790
National 18.585241 8.349249 10.202391 3.210139 6.304858 35.350445 4.172649 2.069473 1.122900 0.835093 ... 38.057873 6.686434 2.818555 0.087964 2.752580 6.091775e+06 0.000000 0.205704 0.059110 0.159334
Nebraska 20.212290 8.278421 11.034195 3.637961 7.392674 41.508335 4.619832 0.325807 1.262182 0.607112 ... 41.379541 7.327492 3.224092 0.111464 1.823797 1.910026e+06 -18.632576 0.608735 0.250951 0.449183
Nevada 14.526139 11.827021 10.807067 3.402596 5.658025 35.582648 4.894715 1.359244 0.740563 1.046410 ... 38.791610 5.606405 2.564620 0.082182 3.312004 3.020135e+06 -73.471929 0.095050 0.027278 0.079621
New Hampshire 22.601028 8.620209 11.509625 3.694819 6.471457 42.003871 4.044458 0.417261 1.338915 0.633947 ... 43.149527 7.081976 3.574282 0.087354 2.102636 1.353111e+06 -43.567824 0.364111 0.136798 0.481784
New Jersey 21.805943 7.953427 10.840442 3.001327 5.830078 34.102104 3.866212 5.211197 1.295085 0.945817 ... 40.754602 6.874836 2.284959 0.069129 3.084758 8.607158e+06 70.532684 0.117520 0.030934 0.176998
New Mexico 16.235249 8.368484 9.316456 3.098929 4.249889 33.625294 4.260363 0.425353 0.779945 0.678647 ... 30.340390 9.580912 2.748369 0.089496 2.980223 2.033606e+06 -69.039963 0.209063 0.060187 0.167816
New York 19.962686 9.726890 10.162657 2.818871 4.913470 25.160427 3.100604 12.129205 2.632111 0.993303 ... 37.713690 7.800679 2.693477 0.071108 2.960639 1.851681e+07 250.664789 0.179984 0.065100 0.182754
North Carolina 18.224936 7.834720 9.786308 3.274216 6.717170 36.939190 4.109554 0.446884 0.680022 0.613556 ... 37.612060 6.490079 2.597726 0.086597 2.687002 1.030290e+07 21.199993 0.154936 0.052326 0.126957
North Dakota 20.469291 9.051543 10.689266 3.992895 6.843893 42.979912 4.681194 0.278289 1.605304 0.708186 ... 40.870746 8.491968 3.866078 0.157057 1.682538 7.341290e+05 -51.037009 0.915645 0.426466 0.678533
Ohio 18.098172 8.075664 10.154925 2.686925 8.131203 38.469280 3.583920 0.676802 0.991343 0.565742 ... 39.762701 5.871590 2.326092 0.075528 2.679132 1.162615e+07 71.469638 0.258710 0.061051 0.139719
Oklahoma 16.082777 7.796205 9.892935 3.234548 6.461184 36.595291 4.350927 0.162698 0.738358 0.616238 ... 34.502210 7.646599 2.958280 0.107904 2.423582 3.918291e+06 -16.801149 0.389395 0.076473 0.205075
Oregon 19.506370 8.483489 10.004232 3.128172 5.982575 33.413197 4.448837 1.898612 1.673337 1.480632 ... 38.106307 6.652683 3.423823 0.097397 2.781505 4.163358e+06 -25.800997 0.176386 0.060836 0.169809
Pennsylvania 19.395753 8.250551 10.120244 2.899997 6.907047 35.495893 3.907322 2.473363 1.628873 0.746123 ... 40.871552 5.192466 2.441574 0.075718 2.748137 1.270246e+07 173.502064 0.373152 0.083435 0.211696
Puerto Rico 10.369517 6.501285 8.184341 1.763543 3.614495 25.630334 2.360733 0.463876 0.811053 0.533410 ... 21.740750 6.610693 3.153418 0.048212 5.603530 3.250219e+06 -86.179069 0.077504 0.028081 0.070680
Rhode Island 20.519943 9.340266 10.967426 3.082906 6.041975 39.545489 4.007028 1.114815 1.476163 0.645869 ... 41.551377 6.551567 2.476068 0.081007 2.965907 1.053000e+06 -82.181309 0.126942 0.043169 0.171293
South Carolina 16.313769 7.919001 9.993936 3.143556 7.160914 37.114389 4.057505 0.230994 0.586460 0.646184 ... 36.136964 6.834258 2.495605 0.097330 2.659608 5.036494e+06 -43.186005 0.173039 0.050290 0.114612
South Dakota 19.219038 8.610229 10.989418 3.689441 7.132849 40.722084 4.106377 0.232588 1.716101 0.586305 ... 39.594528 7.759845 4.027118 0.158990 1.764695 8.478520e+05 -54.679404 0.828174 0.288678 0.531700
Tennessee 16.885217 7.681539 10.142055 3.080457 7.778924 37.629158 4.017868 0.274707 0.530062 0.545297 ... 37.122462 6.179316 3.156276 0.088943 2.609092 6.727897e+06 -12.650009 0.230936 0.042470 0.125883
Texas 17.651444 8.024465 10.358714 4.099250 6.069652 36.622279 4.634543 0.585768 0.645721 0.728613 ... 37.547453 6.270239 3.193893 0.098382 2.641412 2.816679e+07 202.270095 0.111904 0.027794 0.091578
Utah 19.401307 7.297096 11.578655 3.537830 6.296299 35.862658 5.017803 1.072669 1.114570 0.729783 ... 39.322347 7.095843 2.361080 0.091575 1.821483 3.126408e+06 -54.987856 0.132280 0.051307 0.133141
Vermont 22.810934 8.795629 10.101763 3.793831 5.764614 38.657872 4.414257 0.609410 2.532730 0.881069 ... 40.641452 7.690850 4.363877 0.113880 2.058258 6.173180e+05 -51.324871 0.902031 0.236649 1.048757
Virginia 22.288205 8.153862 9.788953 3.260821 5.219895 37.370689 4.451882 1.979120 1.028529 0.866081 ... 37.101869 9.865304 2.383311 0.084018 2.400365 8.320818e+06 46.977457 0.199920 0.051990 0.182205
Washington 20.768858 8.088880 9.525391 3.418154 5.869852 33.722919 4.692491 2.890937 1.576568 0.979126 ... 38.313013 7.670986 2.816186 0.088425 2.525203 7.379130e+06 16.752389 0.129915 0.053226 0.152064
West Virginia 14.423560 7.868152 9.042487 2.694171 5.934481 33.570453 3.542311 0.352015 1.141360 0.488274 ... 31.764317 8.149857 1.763450 0.056353 2.847876 1.719527e+06 -44.836427 0.832852 0.127327 0.281314
Wisconsin 19.148602 8.302859 10.415931 3.033333 9.154616 40.493168 3.889678 0.771039 1.453320 0.748353 ... 42.389704 6.294605 2.618012 0.104956 1.889446 5.796726e+06 31.932751 0.325192 0.107715 0.221534
Wyoming 17.690173 8.550883 9.357948 4.051109 6.342138 38.111351 4.807987 0.494241 1.616081 0.857429 ... 35.678096 10.291278 3.211059 0.191348 2.301315 5.638950e+05 -80.412562 0.418795 0.142173 0.324341

53 rows × 22 columns

In [44]:
#for loop to convert to percentages on a National level

for column in cni_df_sum[['Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','Unemployment','High School Graduate','Some College or Associates degree','Bachelors degree or higher']]:
    # Select column contents by column  
    cni_df_sum[column] = ((cni_df_sum[column] / (cni_df_sum.at['National',column])) *100) - 100
    #print(cni_df_sum.at['National',column])
    
cni_df_sum.head()
Out[44]:
Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp ... PrivateWork PublicWork SelfEmployed FamilyWork Unemployment TotalPop Median Monthly housing costs High School Graduate Some College or Associates degree Bachelors degree or higher
State
Alabama -16.504219 -16.199088 -8.238740 -9.488029 18.326204 3.056970 -14.019750 -93.098495 -60.934176 -53.232258 ... -10.091368 1.984903 -16.540740 -21.790411 -7.182996 4854899.0 -18.165793 37.317488 21.891077 0.332283
Alaska -2.265927 -0.579247 -4.510018 1.200202 -9.118147 -7.264547 36.901604 -71.399352 174.195393 180.232684 ... -15.744993 78.812447 13.912764 33.755889 31.331892 695239.0 -74.700786 158.299280 51.445707 63.365159
Arizona -9.103686 0.418370 5.511798 -0.735574 -22.035855 -6.305705 13.756019 -65.786384 -30.904257 32.014661 ... -5.569889 -7.716035 -1.971831 -5.973019 0.726080 7092238.0 -41.171357 -56.944196 -49.875740 -43.518504
Arkansas -18.034095 -13.427055 -9.250145 -6.151986 19.087660 0.400141 5.519135 -93.642089 -38.910404 -41.116617 ... -9.925889 -0.916729 -7.409642 1.052901 -13.564044 2982323.0 -38.990689 117.688837 37.610423 17.478773
California -2.864696 6.193723 -1.949648 -2.512768 -6.641919 -4.602667 13.354371 -5.615787 -11.264967 26.297216 ... -3.520854 0.705894 29.004818 1.928090 15.154599 36735325.0 310.408103 -71.271531 -60.986107 -47.675711

5 rows × 22 columns

In [45]:
final_cni = pd.DataFrame(cni_df_sum.copy())
final_cni = final_cni.drop(['Unemployment','TotalPop','Professional','Service','Office','Construction','Production','Drive','Carpool','Transit','Walk','OtherTransp','WorkAtHome','Employed','PrivateWork','PublicWork','SelfEmployed','FamilyWork','High School Graduate','Some College or Associates degree','Bachelors degree or higher'], axis=1)

final_cni
Out[45]:
Median Monthly housing costs
State
Alabama -18.165793
Alaska -74.700786
Arizona -41.171357
Arkansas -38.990689
California 310.408103
Colorado -6.453467
Connecticut -29.940409
Delaware -88.392837
District of Columbia -92.594017
Florida 83.591540
Georgia 9.697993
Hawaii -76.966288
Idaho -63.418886
Illinois 126.048495
Indiana 0.314501
Iowa 34.759644
Kansas -3.531575
Kentucky -18.023751
Louisiana -33.921620
Maine -35.127933
Maryland 5.969199
Massachusetts 40.586487
Michigan 50.505568
Minnesota 52.105558
Mississippi -50.171857
Missouri 31.858342
Montana -50.787067
National 0.000000
Nebraska -18.632576
Nevada -73.471929
New Hampshire -43.567824
New Jersey 70.532684
New Mexico -69.039963
New York 250.664789
North Carolina 21.199993
North Dakota -51.037009
Ohio 71.469638
Oklahoma -16.801149
Oregon -25.800997
Pennsylvania 173.502064
Puerto Rico -86.179069
Rhode Island -82.181309
South Carolina -43.186005
South Dakota -54.679404
Tennessee -12.650009
Texas 202.270095
Utah -54.987856
Vermont -51.324871
Virginia 46.977457
Washington 16.752389
West Virginia -44.836427
Wisconsin 31.932751
Wyoming -80.412562
In [46]:
#Type of Work

final_cni['Type of Work'] = (cni_df_sum['Professional'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Service'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Office'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['Construction'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['Production'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"]))) 
final_cni['Mode of Transport'] = (cni_df_sum['Drive'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Carpool'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Transit'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['Walk'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['OtherTransp'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"]))) 
final_cni['Employment Category'] = (cni_df_sum['WorkAtHome'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['PrivateWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['PublicWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[2]["Weights"]))) + (cni_df_sum['SelfEmployed'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[3]["Weights"]))) + (cni_df_sum['FamilyWork'].apply(lambda x: (x * weighted_averages_5_subfactors.iloc[4]["Weights"]))) 
final_cni['Educational Attainment'] = (cni_df_sum['High School Graduate'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[0]["Weights"]))) + (cni_df_sum['Some College or Associates degree'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[1]["Weights"]))) + (cni_df_sum['Bachelors degree or higher'].apply(lambda x: (x * weighted_averages_3_subfactors.iloc[2]["Weights"])))
final_cni
Out[46]:
Median Monthly housing costs Type of Work Mode of Transport Employment Category Educational Attainment
State
Alabama -18.165793 -6.420775 -43.645542 -17.924669 19.648480
Alaska -74.700786 -3.054627 62.533157 19.692507 90.126348
Arizona -41.171357 -5.188989 -11.445133 -0.601713 -49.611685
Arkansas -38.990689 -5.555124 -33.549967 -12.125266 57.016751
California 310.408103 -1.555062 3.633633 7.297230 -59.378005
Colorado -6.453467 8.621056 6.376720 27.242655 -6.749935
Connecticut -29.940409 2.971766 -2.805861 4.603793 -5.463660
Delaware -88.392837 -1.833516 -20.567638 -3.219296 -29.191989
District of Columbia -92.594017 -11.635916 296.921739 34.580900 -60.240443
Florida 83.591540 -1.202009 -17.669933 -4.184747 -47.815837
Georgia 9.697993 -1.213019 -19.660602 -0.732379 -35.555028
Hawaii -76.966288 1.572523 49.866456 11.657827 -31.284312
Idaho -63.418886 0.243237 -14.752802 10.653743 36.703288
Illinois 126.048495 1.136037 23.181959 -5.423844 17.789840
Indiana 0.314501 4.586277 -22.596285 -14.760262 15.915022
Iowa 34.759644 8.968575 -10.382867 10.230992 248.088518
Kansas -3.531575 3.500025 -20.435533 10.243131 137.680629
Kentucky -18.023751 -3.577902 -27.320012 -7.007586 37.031706
Louisiana -33.921620 -5.400558 -23.881518 -14.558129 -11.686857
Maine -35.127933 5.155863 -6.729627 14.668224 279.922128
Maryland 5.969199 3.776286 15.381083 13.039190 -19.836673
Massachusetts 40.586487 3.593387 54.011956 7.319763 -5.934301
Michigan 50.505568 -2.203218 -23.392908 -14.332338 8.929581
Minnesota 52.105558 7.952634 1.221527 8.493854 89.874171
Mississippi -50.171857 -8.966541 -37.405498 -9.807020 23.366263
Missouri 31.858342 1.395160 -26.228471 -2.821347 55.365482
Montana -50.787067 4.562630 9.823041 28.973448 203.829237
National 0.000000 0.000000 0.000000 0.000000 0.000000
Nebraska -18.632576 9.328014 -14.203254 10.344619 231.789416
Nevada -73.471929 4.295213 -5.020433 -9.946994 -52.032452
New Hampshire -43.567824 11.081308 -13.787547 15.407495 135.567691
New Jersey 70.532684 0.961468 33.906069 -5.216583 -26.218678
New Mexico -69.039963 -11.431078 -26.300056 0.590440 2.897117
New York 250.664789 -2.146961 116.986238 -3.010155 4.068501
North Carolina 21.199993 -0.728773 -28.278544 -1.220524 -18.637267
North Dakota -51.037009 11.250868 -5.003790 26.368925 426.514081
Ohio 71.469638 1.261147 -23.310493 -10.725887 5.525082
Oklahoma -16.801149 -3.976402 -28.959381 -0.046179 48.635748
Oregon -25.800997 -0.608657 23.840648 13.139337 -1.570167
Pennsylvania 173.502064 0.452725 9.594976 -8.066395 51.287830
Puerto Rico -86.179069 -34.770746 -42.480419 -29.781819 -56.250793
Rhode Island -82.181309 4.329001 -5.886371 -6.406690 -19.057829
South Carolina -43.186005 -1.582936 -31.400353 -5.427653 -19.426873
South Dakota -54.679404 8.462723 -10.423583 28.973767 305.145643
Tennessee -12.650009 0.320855 -34.297338 -2.135497 -12.169837
Texas 202.270095 3.316845 -22.454659 2.966915 -46.564196
Utah -54.987856 3.070208 -7.963179 5.845127 -21.559777
Vermont -51.324871 7.342181 15.130419 29.344979 395.036111
Virginia 46.977457 -0.419572 0.669468 8.924204 -0.165590
Washington 16.752389 0.315083 21.039538 9.262179 -16.949007
West Virginia -44.836427 -12.294210 -28.603684 -22.259071 163.958084
Wisconsin 31.932751 8.852062 -7.187440 3.254867 59.187157
Wyoming -80.412562 3.222131 -1.297245 33.186922 114.732941

Data Analysis¶

This section will attempt to answer the questions of interest that will assist in building the needs index. We will perfrom exploratory data anlysis to study relations between education, type of work and income.

Q1 What is the median household income in the United States? How does median income and type of employment of people vary in different states in the US?¶

The public sector, also known as the state sector, is a sector of the economy that includes both public services and public enterprises. It excludes private businesses, non-profit organizations, and households. Public sectors include the military, law enforcement, infrastructure, public transportation, public education, and health care, as well as those working for the government itself, such as elected officials.

The private sector is the segment of the economy owned, managed, and controlled by profit-seeking individuals and organizations. Companies in the private sector are typically not owned or controlled by the state.

The private sector is typically the largest, employing roughly ten times as many people as the public sector.

The self-employed proportion of the population is typically covered by the private sector, but to distinguish freelancers from organizationally employed workers, we treat the self-employed count as a separate entity.

It is widely assumed that public jobs raise the median household income of the entire state. We've come to find out if that's correct.

In [47]:
print('Median household income income in the US is : $ ',round(df_final['Income'].mean(),2))
Median household income income in the US is : $  62995.77
In [48]:
income_state= df_final.groupby(['State'],as_index=False)['Income'].mean().sort_values("Income", ascending = False).reset_index()
del income_state['index']
income_state.index = np.arange(1, len(income_state) + 1)
income_state.head(5)
Out[48]:
State Income
1 New Jersey 96061.160305
2 District of Columbia 95414.714286
3 Maryland 92802.761194
4 Massachusetts 90983.526860
5 Connecticut 89040.648221
In [49]:
# Creating one dictionary to replace the State name with the State code
us_state= {
    "Alabama": "AL",
    "Alaska": "AK",
    "Arizona": "AZ",
    "Arkansas": "AR",
    "California": "CA",
    "Colorado": "CO",
    "Connecticut": "CT",
    "Delaware": "DE",
    "Florida": "FL",
    "Georgia": "GA",
    "Hawaii": "HI",
    "Idaho": "ID",
    "Illinois": "IL",
    "Indiana": "IN",
    "Iowa": "IA",
    "Kansas": "KS",
    "Kentucky": "KY",
    "Louisiana": "LA",
    "Maine": "ME",
    "Maryland": "MD",
    "Massachusetts": "MA",
    "Michigan": "MI",
    "Minnesota": "MN",
    "Mississippi": "MS",
    "Missouri": "MO",
    "Montana": "MT",
    "Nebraska": "NE",
    "Nevada": "NV",
    "New Hampshire": "NH",
    "New Jersey": "NJ",
    "New Mexico": "NM",
    "New York": "NY",
    "North Carolina": "NC",
    "North Dakota": "ND",
    "Ohio": "OH",
    "Oklahoma": "OK",
    "Oregon": "OR",
    "Pennsylvania": "PA",
    "Rhode Island": "RI",
    "South Carolina": "SC",
    "South Dakota": "SD",
    "Tennessee": "TN",
    "Texas": "TX",
    "Utah": "UT",
    "Vermont": "VT",
    "Virginia": "VA",
    "Washington": "WA",
    "West Virginia": "WV",
    "Wisconsin": "WI",
    "Wyoming": "WY",
    "District of Columbia": "DC",
    "American Samoa": "AS",
    "Guam": "GU",
    "Northern Mariana Islands": "MP",
    "Puerto Rico": "PR",
    "United States Minor Outlying Islands": "UM",
    "U.S. Virgin Islands": "VI",
}
In [50]:
# To replace the state with the state code
income_state['State'] = income_state['State'].replace(us_state)
In [51]:
# Converting with State code to string data type
income_state.State.astype(str)
Out[51]:
1     NJ
2     DC
3     MD
4     MA
5     CT
6     RI
7     NH
8     HI
9     CA
10    NY
11    DE
12    WA
13    CO
14    VA
15    UT
16    ND
17    MN
18    IL
19    WY
20    VT
21    NV
22    WI
23    PA
24    IA
25    NE
26    OR
27    SD
28    TX
29    AK
30    IN
31    FL
32    OH
33    MI
34    KS
35    ME
36    AZ
37    MT
38    ID
39    NC
40    GA
41    MO
42    TN
43    OK
44    SC
45    AL
46    LA
47    KY
48    WV
49    NM
50    AR
51    MS
52    PR
Name: State, dtype: object
In [52]:
df_work2 = df_final[['PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork', 'Income', 'Unemployment', 'State', 'TotalPop']]
In [53]:
df_work2['TotalWorkPop']= df_work2['PrivateWork'] + df_work2['PublicWork']+ df_work2['SelfEmployed'] + df_work2['FamilyWork']
In [54]:
df_state= df_work2.groupby(['State'],as_index=False)['TotalPop'].sum().sort_values("TotalPop", ascending = False)
In [55]:
df_work= df_work2.groupby(['State'],as_index=False)['PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork', 'Unemployment', 'TotalWorkPop'].sum().sort_values("PublicWork", ascending = False)
In [56]:
df_income= df_work2.groupby(['State'],as_index=False)['Income'].mean().sort_values("Income", ascending = False)
In [57]:
df_map = df_state.merge(df_work,on='State').merge(df_income,on='State')
In [58]:
# To convert work related columns to percentage values
df_map.iloc[:,2:7] = df_map.iloc[:,2:7].div(df_map['TotalWorkPop'],axis=0) 
df_map.iloc[:,2:7] =df_map.iloc[:,2:7].apply(lambda x: x*100)
In [59]:
df_map
Out[59]:
State TotalPop PrivateWork PublicWork SelfEmployed FamilyWork Unemployment TotalWorkPop Income
0 California 36735325 77.829654 14.273043 7.707254 0.190050 6.718749 17330726 75406.715000
1 Texas 28166789 79.701719 13.309793 6.779654 0.208834 5.606907 13269365 61707.793103
2 Florida 21073233 82.221737 11.676355 5.909527 0.192382 5.740752 9637100 60075.754135
3 New York 18516812 78.116212 16.157514 5.578987 0.147286 6.132360 8939723 72053.825133
4 Pennsylvania 12702461 84.130198 10.688197 5.025748 0.155857 5.656778 6171022 63147.635317
5 Illinois 12612412 83.063690 12.201407 4.568631 0.166272 6.351201 6197064 65904.558552
6 Ohio 11626152 82.777031 12.223335 4.842402 0.157232 5.577353 5584728 60064.601277
7 Georgia 10428079 80.151145 14.247367 5.412382 0.189105 5.925595 4885113 54812.826484
8 North Carolina 10302895 80.390904 13.871703 5.552303 0.185090 5.743119 4820360 54940.244265
9 Michigan 9950858 84.245492 10.632322 4.959504 0.162682 6.391704 4649558 59328.206568
10 New Jersey 8607158 81.536070 13.754204 4.571424 0.138303 6.171550 4302161 96061.160305
11 Virginia 8320818 75.052579 19.956313 4.821150 0.169959 4.855647 4113355 69735.424324
12 Washington 7379130 78.367975 15.690742 5.760413 0.180870 5.165217 3607554 70725.940840
13 Arizona 7092238 79.943637 13.726180 6.146196 0.183987 6.167524 3188265 58111.956140
14 Tennessee 6727897 79.752647 13.275434 6.780837 0.191082 5.605285 3131634 52762.074913
15 Indiana 6659917 84.476645 10.839071 4.525841 0.158443 4.950071 3205570 60169.627809
16 Massachusetts 6658031 81.807309 12.407540 5.629689 0.155462 5.359095 3515967 90983.526860
17 Missouri 6077044 81.990416 12.263053 5.548685 0.197845 4.717063 2917430 54102.435897
18 Maryland 5842390 73.206735 21.646534 4.991160 0.155571 5.497817 2991569 92802.761194
19 Wisconsin 5796726 82.458567 12.244580 5.092688 0.204165 3.675444 2979939 63733.914324
20 Colorado 5638163 79.999959 13.509603 6.252690 0.237748 4.839025 2924949 70080.758542
21 Minnesota 5594551 82.334509 12.149691 5.351663 0.164137 3.957010 2954857 67039.682783
22 South Carolina 5036494 79.310068 14.999198 5.477123 0.213610 5.837062 2294836 50157.204360
23 Alabama 4854899 78.737221 15.691506 5.412966 0.158307 5.878977 2109823 49457.133690
24 Louisiana 4604560 78.421545 15.190556 6.210429 0.177469 7.013853 2005417 49176.983721
25 Kentucky 4382880 80.327364 14.241736 5.195997 0.234903 5.641826 1972730 48647.682968
26 Oregon 4163358 78.927385 13.779316 7.091566 0.201733 5.761169 2010078 62107.326923
27 Oklahoma 3918291 76.307012 16.911645 6.542696 0.238647 5.360129 1771655 51905.199662
28 Connecticut 3373552 80.607909 13.097148 6.124343 0.170601 6.488202 1716870 89040.648221
29 Puerto Rico 3250219 68.902163 20.951027 9.994013 0.152797 17.759063 1025544 20779.390244
30 Iowa 3135234 79.972449 13.424958 6.382757 0.219835 4.053189 1605748 62719.079955
31 Utah 3126408 80.461770 14.519583 4.831265 0.187381 3.727137 1527902 69376.398340
32 Nevada 3020135 82.456713 11.917158 5.451441 0.174688 7.040104 1420817 64141.641791
33 Arkansas 2982323 78.617255 15.193865 5.985022 0.203858 5.456420 1300413 46117.698745
34 Mississippi 2944501 75.787321 18.464377 5.467897 0.280405 7.655397 1232503 42658.184066
35 Kansas 2872773 77.982696 15.709994 6.077876 0.229434 4.193154 1433527 59156.324717
36 New Mexico 2033606 70.956457 22.406685 6.427555 0.209303 6.969788 869553 47735.769608
37 Nebraska 1910026 79.510919 14.079799 6.195103 0.214179 3.504432 994027 62234.540856
38 Idaho 1737264 77.926416 14.561992 7.269294 0.242297 4.368869 811812 56274.102222
39 West Virginia 1719527 76.111406 19.528111 4.225454 0.135028 6.823879 717627 47934.045872
40 Hawaii 1409701 72.569058 20.335576 6.874436 0.220931 4.768529 672157 76795.358974
41 New Hampshire 1353111 80.064972 13.140775 6.632165 0.162088 3.901491 729234 81356.851064
42 Maine 1333010 77.959227 13.448993 8.421250 0.170530 4.194865 672608 58251.154242
43 Rhode Island 1053000 82.020058 12.932421 4.887618 0.159903 5.854532 533450 81764.101449
44 Montana 1043126 74.901704 16.324044 8.476532 0.297720 4.295565 512226 57263.481481
45 Delaware 959988 80.270834 15.210649 4.327359 0.191159 6.192751 454596 71671.214286
46 South Dakota 847852 76.822194 15.055825 7.813505 0.308476 3.423900 436987 61790.978947
47 North Dakota 734129 76.557265 15.906777 7.241766 0.294192 3.151656 391921 68542.468852
48 Alaska 695239 67.720240 25.250534 6.780742 0.248483 7.634638 329197 61017.500000
49 District of Columbia 678574 70.770899 24.825811 4.264477 0.138813 7.755796 371722 95414.714286
50 Vermont 617318 76.957786 14.563229 8.263345 0.215640 3.897474 326006 65786.290456
51 Wyoming 563895 72.264148 20.844453 6.503834 0.387565 4.661195 278405 65843.103774
In [60]:
df_map = df_map.round(1)
In [61]:
# Creating new column of State Code based on State
df_map['State Code'] = df_map['State'].replace(us_state)
In [62]:
# To display the first five rows of the data frame
df_map.sort_values(by=['Income','PrivateWork'], ascending=False).iloc[:,[1,2,3,8,-2]].head()
Out[62]:
TotalPop PrivateWork PublicWork Income Income
10 8607158 81.5 13.8 96061.2 96061.2
49 678574 70.8 24.8 95414.7 95414.7
18 5842390 73.2 21.6 92802.8 92802.8
16 6658031 81.8 12.4 90983.5 90983.5
28 3373552 80.6 13.1 89040.6 89040.6
In [63]:
# To create a new column to display the hover text
df_map['text'] = 'State: ' + df_map['State'].astype(str) + '<br>' + 'Private Work: ' + df_map['PrivateWork'].astype(str)+  '%'+ '<br>' + 'Public Work: ' + df_map['PublicWork'].astype(str)+ '%'+ '<br>' + 'Self Employed: ' + df_map['SelfEmployed'].astype(str) + '%'+ '<br>' + 'Family Work: ' + df_map['FamilyWork'].astype(str) + '%'+ '<br>' +  'Unemployment: ' + df_map['Unemployment'].astype(str) +'%'
In [64]:
# To plot the map to show the Work Distribution 
fig = go.Figure(data = go.Choropleth(
    locations = df_map['State Code'],
    z = df_map['Income'],
    locationmode = "USA-states",
    colorscale = "turbo",
    colorbar_title = "Income Distribution (in $)",
    text = df_map['text']))

fig.update_layout(
    title={
        'text': "Distribution of Median Household Income and type of work across US States",
        'y':0.9,
        'x':0.4,
        'xanchor': 'center',
        'yanchor': 'top'},
    geo_scope = "usa"
    )


fig.show()

Public sectors include the military, law enforcement, infrastructure, public transportation, public education, and health care, as well as those working for the government itself, such as elected officials.

The private sector is the segment of the economy owned, managed, and controlled by profit-seeking individuals and organizations. Companies in the private sector are typically not owned or controlled by the state. Here are some examples: Plumbers, technicians, contractors, developers, and designers are examples of sole proprietorships. Partnerships in the legal, accounting, tax, and dentistry fields. Hospitality, leisure, retail, and food are examples of privately held corporations.

The self-employed proportion of the population is typically covered by the private sector, but to distinguish free lancers from organizationally employed workers, we treat the self-employed count as a separate entity.

It is widely assumed that private through top 3 states with highest income. jobs raise the median household income of the entire state. We've come to find out if that's correct

Observation:¶

A chart of the United States allows us to see the various levels of work distribution across the states as well as their median earnings. Hovering over each state displays its median statewide income, workforce proportions in the private, public, and self-employed sectors, as well as the population engaged in family work and the unfortunate proportion of people who remain unemployed.

Inference:¶

It is clear that states with more public sector workers have higher median household incomes than states with fewer public workers. However, states is significantly lower private sector workers seem to be still trailing behind the majority. There could be several reasons for this to occur. To start with, increasing public sector employment can be an effective short-term method of reducing unemployment and providing a stabilizing effect. Employment in the public sector can stimulate demand in other sectors of the economy (for example, private services) and helps to implement equal and fair policies such as encouraging marginalized and/or disadvantaged groups to work. The above US map shows that in states such as Maryland and Virginia, and even in the District of Columbia, where public sector employment is higher, private sector employment is nearly four times higher. Naturally, a common inference would be that a state with more public sector workers would do better and have a higher median household income, and it’s true.

In [65]:
df_corr_work=df_final[['Income','PrivateWork', 'PublicWork', 'SelfEmployed','FamilyWork']]
df_corr_work.div(df_final['Employed'], axis=0)

df_corr_work.corr()
Out[65]:
Income PrivateWork PublicWork SelfEmployed FamilyWork
Income 1.000000 0.231417 0.215182 0.212151 0.129838
PrivateWork 0.231417 1.000000 0.865973 0.891819 0.546711
PublicWork 0.215182 0.865973 1.000000 0.788791 0.488799
SelfEmployed 0.212151 0.891819 0.788791 1.000000 0.536307
FamilyWork 0.129838 0.546711 0.488799 0.536307 1.000000

Q2: Do the varying levels of education influence a state’s median income?¶

When it comes to understanding how a particular state's income is high, we look at the education levels of the most educated state and the least educated and compare their income levels.

In [66]:
df_edu = df_final[['High School Graduate','Some College or Associates degree', 'Bachelors degree or higher','TotalPop', 'Income', 'State']]
In [67]:
edu1_state= df_final.groupby(['State'],as_index=False)['High School Graduate'].mean().sort_values("High School Graduate", ascending = False)
In [68]:
#df_final['Some College or Associates degree'] = df_final['Some College or Associates degree'].astype(int)

edu2_state= df_final.groupby(['State'],as_index=False)['Some College or Associates degree'].mean().sort_values("Some College or Associates degree", ascending = False)
In [69]:
#df_final['Bachelors degree or higher'] = df_final['Bachelors degree or higher'].astype(int)

edu3_state= df_final.groupby(['State'],as_index=False)['Bachelors degree or higher'].mean().sort_values("Bachelors degree or higher", ascending = False)
In [70]:
dup_edu= edu1_state.merge(edu2_state,on='State').merge(edu3_state,on='State')
In [71]:
dup_edu.head()
Out[71]:
State High School Graduate Some College or Associates degree Bachelors degree or higher
0 West Virginia 32.846588 5.021604 11.094666
1 Pennsylvania 30.325949 6.780703 17.204514
2 Arkansas 27.938707 5.074999 11.678701
3 Missouri 27.461924 5.610955 14.098444
4 Ohio 27.443460 6.476209 14.821157

When it comes to understanding how a particular state's income is high, we look at the education level within the state, and the second part is to see if the minority race has received the same level of education as well. We will examine equally racially distributed states to see if their education level varies and, if so, how it affects income.

In [72]:
fig = px.bar(dup_edu, x="State", y=['High School Graduate', 'Some College or Associates degree', 'Bachelors degree or higher'], title="Education Distribution across States", width=1100, height=600)
fig.update_layout(barmode='stack', xaxis={'categoryorder': 'total descending'})

fig.update_layout(
    title={
        'text': "Education Distribution across States",
        'y':0.9,
        'x':0.4,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="State",
    yaxis_title="Percentage of total population",
    legend_title="Education Level")
fig.show()

Observation:¶

The stacked bar chart is very straightforward. It displays the educational levels of each state as well as the proportion of its population that is educated. Furthermore, the chart is arranged in descending order, beginning with the state with the highest educated population and ending with the state with the lowest.

Inference:¶

The inference from the bar shows a trend of how education levels vary in each state, and thus income fluctuation will also stand in accordance with it. We come to the conclusion that education directly affects income levels.

Q3 - Can their be a model to predict the median household income of a state in the US with the help of factors such as education, employment and type of work which are known to influence the income?¶

By creating a machine learning model that will make use of the above variables, we will learn how do changes in the said variables affect the median household income of a state in the US.

Further, the combination of the Community Needs Index and our model would be a solution for companies to get information on what geographical areas to work in and what factors to work on?

Analysing variables that have high correlation with income.

In [73]:
df_final.corr().iloc[:,0]*100
Out[73]:
Income                               100.000000
Professional                          37.425576
Service                                7.523039
Office                                21.133021
Construction                          13.870081
Production                             1.433382
Drive                                 22.405340
Carpool                               11.804959
Transit                                8.755939
Walk                                   6.798403
OtherTransp                           10.939322
WorkAtHome                            38.836495
Employed                              18.149466
PrivateWork                           23.141678
PublicWork                            21.518222
SelfEmployed                          21.215105
FamilyWork                            12.983791
Unemployment                           3.597943
TotalPop                              18.033850
Men                                   18.286431
Women                                 17.743456
Hispanic                              -0.061469
White                                 23.210477
Black                                 -6.816756
Citizen                               18.715275
Housing with Mortgage                 32.528122
Median Monthly housing costs          74.232353
High School Graduate                 -46.421442
Some College or Associates degree      1.172333
Bachelors degree or higher            66.420682
RaceOther                             23.615976
Name: Income, dtype: float64
In [74]:
# Importing libraries

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score,mean_absolute_error,r2_score,mean_absolute_percentage_error

# Dropping irrelevant columns for training the model

x_features = df_final.drop(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',"City","State","RaceOther"], axis=1)
y_output = df_final['Income']


# Diving the data into training and testing dataset
x_train, x_test, y_train, y_test = train_test_split(x_features, y_output, shuffle=True)

# Building a linear regression model
model=LinearRegression()
model.fit(x_train, y_train)

# Predicting on test data
y_pred = model.predict(x_test)


# Model performance evaluation
print("\tMean absolute error:", mean_absolute_error(y_test, y_pred))
print("\tR2 score:", r2_score(y_test, y_pred))
print(mean_absolute_percentage_error(y_test, model.predict(x_test))) 

plt.figure()
ax1 = sns.distplot(y_test, hist=False, color="r", label="Actual Value")
sns.distplot(y_pred, hist=False, color="b", label="Predicted Values" , ax=ax1, )

plt.title('Actual (Red) vs Predicted (Blue) Income')
plt.xlabel('Income')
plt.ylabel('Income')
plt.show()
plt.close()
	Mean absolute error: 9668.872091027031
	R2 score: 0.6912387735814876
0.17908434488282726

Model Evaluation:¶

Based on the performance evaluation parameters for the model, we obtain the following scores:

1. R2 (R-squared) Value: 0.68
2. Mean Absolute % Error : 17.7%

Inference:¶

These value suggest that our model predicts the income with an error of 17.5%.
We can leverage this model to predict what will the median household income be for a given state.

Working on our Model:¶

For our analysis, we have decided to sample the CNI for Alabama as the average median income for the state lies in the range of 50,000 to 60,0000 which shows the highest variance in the graph of our model above.

In [75]:
df_AL=df_final[df_final['State']=='AL']
In [76]:
df_AL=df_final[df_final['State']=='Alabama']
In [77]:
df_AL = df_AL.drop(['Income', 'IncomeErr', 'IncomePerCap', 'IncomePerCapErr',"City","State","RaceOther"], axis=1)
In [78]:
df_AL
Out[78]:
Professional Service Office Construction Production Drive Carpool Transit Walk OtherTransp ... Women Hispanic White Black Citizen Housing with Mortgage Median Monthly housing costs High School Graduate Some College or Associates degree Bachelors degree or higher
ZipCode
35004 2112 857 1357 503 843 5228 272 0 8 0 ... 6245 0 10142 1303 9254 2509 1122 27.670797 7.560281 14.057267
35005 1118 455 1352 174 707 3242 330 0 18 3 ... 4869 43 4299 4167 6872 1490 1105 33.783324 4.049915 8.893931
35006 132 216 609 66 135 1204 92 0 11 0 ... 1565 0 3061 103 2600 446 823 37.381703 3.091483 4.416404
35007 5601 2055 3360 811 1196 10720 1349 0 16 72 ... 13910 1514 19291 4680 18950 5632 1234 15.695964 6.325915 23.353452
35010 2127 1600 1851 590 2070 7031 1071 9 17 40 ... 10464 712 13162 6452 15492 2558 994 24.742676 5.289984 11.411322
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
36915 96 0 19 6 33 158 79 0 0 0 ... 420 0 496 148 533 30 944 26.046512 3.565891 11.007752
36916 79 19 78 43 40 260 32 0 3 1 ... 502 0 487 543 770 50 1125 22.427184 15.048544 4.854369
36919 187 130 81 29 114 455 57 0 11 10 ... 881 0 869 836 1337 147 887 29.232572 7.498535 5.155243
36921 91 45 57 36 48 254 28 0 0 0 ... 447 0 768 46 783 131 1034 30.712531 11.670762 7.002457
36925 250 321 277 20 182 987 31 0 0 0 ... 1597 70 286 2545 2233 321 773 23.134583 5.144804 9.982964

561 rows × 29 columns

In [79]:
y_pred_al = model.predict(df_AL)
In [80]:
y_pred_al.mean()
Out[80]:
52782.82952864726
In [81]:
income_state[income_state['State']=='AL']
Out[81]:
State Income
45 AL 49457.13369

From the above analysis, we understand that the predicted income for AL is 52673$, whereas the actual median income is $49457. This shows us that AL has a scope to improve the income.

Now that we have a model that can tell us if a state's median household income can be increased, we must determine which areas need to be improved and in what order. That is exactly what our upcoming analysis will reveal

Q4. Community Needs Index¶

What are the factors that determine what the needs of the people of a state are? How does the state's needs compare to the national average in terms of education, modes of transport, types of jobs or housing costs?¶

In this question, we are devising a methodology to compare living conditions in different states.

In [82]:
#We wil be leveraging the data calculated for the CNI during our processing. The data was consolidated into 4 main factors.
In [83]:
# Now we are filtering out the information for Alabama as it has the lowest education
final_cni_Alabama = final_cni.filter(items=['Alabama'], axis=0)
final_cni_Alabama

cni_df_sum_transposed = final_cni_Alabama.T
cni_df_sum_transposed.columns.name = 'Factors'
In [84]:
#Plotting the CNI

cni_plot = cni_df_sum_transposed

## To add the column with colors
cni_plot["Color"] = np.where(cni_plot["Alabama"]<0, 'red', 'green')

# To plot the map
fig = go.Figure()
fig.add_trace(
    go.Bar(name='Factors',
           x=cni_plot.index.values,
           y=cni_plot['Alabama'].values,
           marker_color=cni_plot['Color']))

# To update the title of the Map
fig.update_layout(
    title={
        'text': "CNI for Alabama",
        'y':0.9,
        'x':0.45,
        'xanchor': 'center',
        'yanchor': 'top'},
    xaxis_title="Factors",
    yaxis_title="Difference(%) compared to national value"
)

fig.update_layout(barmode='stack')
fig.show()

The graph above shows us in which factors Alabama is below the national mean. The CNI verifies our theory that as the income is lower than it's potential, it has a scope for improvement in one or more factors which need to be considered when looking to improve the lifestyle of the people in a State.

Conclusion:¶


For variable identification, we looked at factors (such as jobs) required by the community by assigning weights to sub-factors (such as public, private, and self-employment) and converting absolute numbers provided for these factors into proportions to compare geographies. Following that, we developed a highly dynamic predictive model based on linear regression with an accuracy of 83%. This model assists us in determining which states have room for improvement in terms of income. If our predicted value is higher than what the current average income is, we conclude that the state’s conditions can be bettered.

The Community Needs Index (CNI) analysis discusses the areas that can be improved within a state. We have brought down 39 columns of data into just 4 major factors. This will help people understand the needs of a state on a broader scope. They also have the opportunity to delve into the data at a sub-factor level.

Alabama was chosen as an example because its income could increase by 6.7%. So, when we conducted an in-depth analysis, we discovered that, while affordable housing was 18% lower than the national average, transportation in the state was less than 43% of the national average, and employment was 17% lower than the national average.

With this in-depth analysis, we can easily conclude that the state of Alabama or any magnanimous organization should prioritize transportation before addressing other issues.

Takeaways for organizations (business outcome):¶


Aside from the obvious interests that NGOs and governments have in the CNI, the best clients are for-profit organizations that want to improve their CSR activities. Big organizations take responsibility for giving back to the community, and it is common to see them construct parks, schools, overhead bridges, and restore landscapes that benefit the environment, among so many others.
With this model, any organization can easily understand the community's true needs, focusing on the pain points of each individual at the state level. This assists businesses in achieving accurate results and, as a result, receiving praise, adoration, and brand recognition.

Future iterations of the model and its scope:
Now that we have a model that works at the state level, we can replicate it down to the zip code level. As a result, even smaller organizations and philanthropists can help uplift communities around them beyond their own capacity.

References:¶

[1] https://data.census.gov/
[2] https://papers.ssrn.com/sol3/papers.cfm?abstract_id=868491